Deep Learning NYU/Week 2

Parametrized Models
Symbols – similar to Factor Graphs
- Bubbles
  - Black = observed variables
  - Blue = computed variable
- Round blue shape
  - Direction == easy to compute in this dir
  - Deterministic functions
- Red square
  - Cost function
  - single scalar output
Loss Function
Minimization by gradient based methods
- Can easily find the gradient of a function
  - function is differentiable
  - almost everywhere
  - should be continuous
  - can have kinks
- Gradient Descent
- There are algorithms that aren't gradient based
  - staircase type
  - don't know a function / can't get a gradient
  - zero'th order methods / gradient free methods
    - whole family of these methods
    - used in reinforcement learning
      - where the cost isn't differentiable
      - (cost becomes a black box)
    - can apply gradient estimation
  - very inefficient for high dimensions with a huge space to search
- Can have a critic method Actor Critic/AAC/etc.
  - By training a "C" module that is differentiable to estimate the cost function
- Reward is negative of a cost
For batches, roughly use number of categories (or 2x) for batch size
Neural Nets
Backprop
Pytorch
- import nn from torch
- make a class fo the net (nn.Module)
- Linear layers
Chain rule for vector functions
- Row vectors
Jacobian Matrix
Can turn a graph into a graph that computes the gradients to backpropagate the gradient
- Can be very complex if the architecture is data dependent
Modules used in neural nets
- used because they're optimized
- Linear: Y = W.X
- ReLU: y = ReLU(x)
- Duplicate: y1 = x ; y2 = x
  - Used when wire splits into two
- Add: y = x1 + x2
- Max: y = max(x1, x2)
- LogSoftMax: y = x_i - log(sum_j e^xj )
Softmax
Sigmoid used with asymptotes doesn't work very well for classification
- sigmoid at its extremes is very small because sigma is flat
- this leads to the saturation problem
- Solutions
  - Set targets in between instead of 1/0 (eg. .8 and .2)
  - Or take the log of it
Taking the log of the sigmoid
- S - log(1 + e^S )
- large S ~ S
- small S is dominated by log
- doesn't saturate! – no vanishing gradients
Tricks
- Use ReLU non linearities – works well for many layers (scaling invariant)
- Cross entropy loss – log softmax is a simpler special case
- Stochastic gradient on minibatches
- Shuffle the training samples
  - Otherwise the last layer just learns the current type of input
- Normalize inputs to 0 mean and unit variance
  - Can use it on rgb as well
    - the channels have very different means
- Schedule a decrease of the learning rate
- Dropout regularization
  - l2 -> at every update, weight decay
    - L = C() + (alpha) * R(w); R(w) = ||w||²
    - Leads to shrinking the weight at every iteration
  - l1 -> R(z) = sum over i |w_i |
    - "lasso"
    - least absolute shrinkage and selection operator
Additional references
- Efficient Backprop
- Tricks neural network
Any directed acyclic graph is ok for backprop
Lab
- Neural networks are rotation and squishing
- Draw inputs at the bottom
- Having a high dimension intermediate representation is very helpful or simply have more hidden layers
  - Because the number of connections grow significantly
- Logit output of final layer
- loss is cross entropy / negative log likelihood
- Choice of activation function is very important
- bunch of networks with initial values – get variance to understand uncertainty in predictions